Content
1. Introduction
The research is based on the the effectiveness of a promotion - Type A. The goal of the study is to derive insights from the available ‘Customer Full Journey’ datasets relating to customers, transactions, and how these factors influence the sales when the promotion was active. As part of our methodology we plan to clean, tidy, explore, and analyze the datasets we received to finally answer the following questions of interest -
Question #1: Was the customer responsive to the promotion/campaign ran during a specific period?
Question #2: Was holiday a driver of sales or was it the promotion that helped in acquiring new customer, retaining existing customer and overall increasing customer enagement?
Question #3: To what degree was the promotion effective?
Question #4: How were the customer purchase pattern different when the promotion was active VS. when the promotion was inactive?
Question #5: What was the Expenditure Per Hour (EPH) when the promotion was active VS. when the promotion was inactive?
Question #6: What are the common demographics of the customers who displayed disparate purchasing pattern when the promotion was active VS. when the promotion was inactive?
Question #7: What are the top product categories that experienced uptick/decline in sales and sales volume?
Our methodology can be broadly classified into three steps -
Data cleaning: Clean and tidy the tables/datasets in R, and make data ready for analysis
Data Exploration: Explore the nature and source of data, explore missing values and reasons ; identify abnormal values and outliers, and visualize variable distributions.
Data Analysis: Applying relevant joins, filters, aggregation methods and calculations to derive KPIs and effective measure to derive meaningful insights. We propose a comparative analysis of sales and sales volume, followed by descriptive insights on the demographics and products.
This research will help the client in two major ways. Firstly, understand better as to what factors impact the success of a promotion and better understand the demographics and product category influenced by the promotion. The insights can be further used to hyperpersonalize marketing initiatives to target specific customer segment in the future. Secondly, design better promotions against different product types and demographic features of the customers powered by the knowledge of customer engagement (and spending) levels with various product categories.
2. Packages
Packages to be installed
Following Packages are needed for the project-
completejourney - Retail shopping transactions for
2,469 households over one year;
tidyverse - Included for data tidying ; Functions-To
use gather and spread function
ggplot2 - To create Data Visualizations
plotly - To create Data Visualizations
patchwork - To combine separate ggplots into the same
graphic
lubridate - To analyse Date-Time data
viridis - To use different color palettes
knitr - To do formatting in R markdown
kableExtra - To do formatting in R markdown
reshape2 - To reshape data
ggalluvial - To produce alluvial plots in a tidyverse
framework
suppressWarnings(suppressMessages(library(completejourney)))
suppressWarnings(suppressMessages(library(tidyverse)))
suppressWarnings(suppressMessages(library(plotly)))
suppressWarnings(suppressMessages(library(ggplot2)))
suppressWarnings(suppressMessages(library(patchwork)))
suppressWarnings(suppressMessages(library(lubridate)))
suppressWarnings(suppressMessages(library(viridis)))
suppressWarnings(suppressMessages(library(reshape2)))
suppressWarnings(suppressMessages(library(ggalluvial)))
3. Data Sources & Pre-Processing
3.1. Data Source : The original data on the retail client was downloaded from the Data Wrangling course folders
3.2.1 Introduction of source data: The dataset primarily tracks the transaction level data from 2,500 households who frequently shop at the client’s stores. Moreover, the data captures purchases made across the entire gamut of product categories available at the retailer’s stores. Among other important information, we have customer demographics data and direct marketing campaigns data available for certain households. (Please Note: The terms ‘Customer’ and ‘Household’ have been assumed synonymous and used interchangeably in this report)
Timeline: The data tracks customer transactions over a period of one year (52 weeks).
3.2.2 Explanation of source data: (tables and variables) In total, this comprehensive total customer journey had 8 different data frames. Each of these are described in brief below. Also a small glimpse into each of these data is given below
a) campaigns: campaigns received by each household
| campaign_id | household_id |
|---|---|
| 1 | 105 |
| 1 | 1238 |
| 1 | 1258 |
| 1 | 1483 |
| 1 | 2200 |
b) campaign_descriptions: campaign metadata (length of time active)
| campaign_id | campaign_type | start_date | end_date |
|---|---|---|---|
| 1 | Type B | 2017-03-03 | 2017-04-09 |
| 2 | Type B | 2017-03-08 | 2017-04-09 |
| 3 | Type C | 2017-03-13 | 2017-05-08 |
| 4 | Type B | 2017-03-29 | 2017-04-30 |
| 5 | Type B | 2017-04-03 | 2017-05-07 |
c) coupons: coupon metadata (UPC code, campaign, etc.)
| coupon_upc | product_id | campaign_id |
|---|---|---|
| 10000085207 | 9676830 | 26 |
| 10000085207 | 9676943 | 26 |
| 10000085207 | 9676944 | 26 |
| 10000085207 | 9676947 | 26 |
| 10000085207 | 9677008 | 26 |
d) coupon_redemptions: coupon redemptions (household, day, UPC code, campaign)
| coupon_upc | product_id | campaign_id |
|---|---|---|
| 10000085207 | 9676830 | 26 |
| 10000085207 | 9676943 | 26 |
| 10000085207 | 9676944 | 26 |
| 10000085207 | 9676947 | 26 |
| 10000085207 | 9677008 | 26 |
e) demographics: household demographic data (age, income, family size, etc.)
| household_id | age | income | home_ownership | marital_status | household_size | household_comp | kids_count |
|---|---|---|---|---|---|---|---|
| 1 | 65+ | 35-49K | Homeowner | Married | 2 | 2 Adults No Kids | 0 |
| 1001 | 45-54 | 50-74K | Homeowner | Unmarried | 1 | 1 Adult No Kids | 0 |
| 1003 | 35-44 | 25-34K | NA | Unmarried | 1 | 1 Adult No Kids | 0 |
| 1004 | 25-34 | 15-24K | NA | Unmarried | 1 | 1 Adult No Kids | 0 |
| 101 | 45-54 | Under 15K | Homeowner | Married | 4 | 2 Adults Kids | 2 |
f) products: product metadata (brand, description, etc.)
| product_id | manufacturer_id | department | brand | product_category | product_type | package_size |
|---|---|---|---|---|---|---|
| 25671 | 2 | GROCERY | National | FRZN ICE | ICE - CRUSHED/CUBED | 22 LB |
| 26081 | 2 | MISCELLANEOUS | National | NA | NA | NA |
| 26093 | 69 | PASTRY | Private | BREAD | BREAD:ITALIAN/FRENCH | NA |
| 26190 | 69 | GROCERY | Private | FRUIT - SHELF STABLE | APPLE SAUCE | 50 OZ |
| 26355 | 69 | GROCERY | Private | COOKIES/CONES | SPECIALTY COOKIES | 14 OZ |
g) promotions_sample: a sampling of the product placement in mailers and in stores corresponding to advertising campaigns
| product_id | store_id | display_location | mailer_location | week |
|---|---|---|---|---|
| 1000050 | 337 | 3 | 0 | 1 |
| 1000092 | 317 | 0 | A | 1 |
| 1000214 | 317 | 6 | 0 | 1 |
| 1000235 | 317 | 0 | A | 1 |
| 1000235 | 337 | 0 | A | 1 |
h) transactions_sample: a sampling of the products purchased by households
| household_id | store_id | basket_id | product_id | quantity | sales_value | retail_disc | coupon_disc | coupon_match_disc | week | transaction_timestamp |
|---|---|---|---|---|---|---|---|---|---|---|
| 2261 | 309 | 31625220889 | 940996 | 1 | 3.86 | 0.43 | 0 | 0 | 5 | 2017-01-28 14:06:53 |
| 2131 | 368 | 32053127496 | 873902 | 1 | 1.59 | 0.90 | 0 | 0 | 10 | 2017-02-28 22:31:57 |
| 511 | 316 | 32445856036 | 847901 | 1 | 1.00 | 0.69 | 0 | 0 | 13 | 2017-03-26 13:22:21 |
| 400 | 388 | 31932241118 | 13094913 | 2 | 11.87 | 2.90 | 0 | 0 | 8 | 2017-02-18 13:13:10 |
| 918 | 340 | 32074655895 | 1085604 | 1 | 1.29 | 0.00 | 0 | 0 | 10 | 2017-03-02 15:05:57 |
4. Data Preparation
4.1 Tables of concern for Problem #1:
- transactions data
- campaign_descriptions data
- coupons_redemption data
- products data
4.2 Data Preparation for insight generation:
a. Joining datasets
- Join the transactions, coupons and campaign_desc tables
- filtering out null values after the join as they represent mismatched timestamps
- transactions displayed where Campaign A was effective VS. transactions without promotion period
transactions <- get_transactions()
promotions <- get_promotions()
transactions <- transactions %>%
mutate(transaction_date = as.Date(transactions$transaction_timestamp))
coupon_redemptions <- coupon_redemptions %>%
left_join(campaign_descriptions, by = "campaign_id") %>%
full_join(coupons, by = c("coupon_upc", "campaign_id"))
transactions_involving_coupons <- transactions %>%
left_join(coupon_redemptions, by = c("household_id", "product_id", "transaction_date" = "redemption_date")) %>%
filter(!is.na(campaign_id))
transactions_involving_coupons%>%
glimpse()
## Rows: 3,538
## Columns: 17
## $ household_id <chr> "1029", "165", "712", "2488", "1923", "1923", "1…
## $ store_id <chr> "438", "321", "448", "324", "369", "369", "369",…
## $ basket_id <chr> "31198980681", "31242775833", "31316965685", "31…
## $ product_id <chr> "9194206", "1090676", "9677486", "12171765", "84…
## $ quantity <dbl> 2, 2, 2, 1, 1, 1, 6, 1, 1, 2, 1, 1, 1, 1, 1, 1, …
## $ sales_value <dbl> 4.28, 3.00, 5.54, 2.00, 0.25, 0.50, 2.60, 2.00, …
## $ retail_disc <dbl> 2.10, 1.58, 1.84, 0.49, 0.19, 0.19, 1.14, 1.99, …
## $ coupon_disc <dbl> 0.00, 1.00, 0.00, 0.50, 0.25, 0.00, 1.00, 1.00, …
## $ coupon_match_disc <dbl> 0.00, 0.00, 0.00, 0.50, 0.25, 0.00, 1.00, 0.00, …
## $ week <int> 1, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 7, 7, …
## $ transaction_timestamp <dttm> 2017-01-01 11:58:48, 2017-01-03 12:03:19, 2017-…
## $ transaction_date <date> 2017-01-01, 2017-01-03, 2017-01-07, 2017-01-10,…
## $ coupon_upc <chr> "51380041313", "53377610033", "54300016033", "51…
## $ campaign_id <chr> "26", "26", "26", "26", "26", "26", "26", "26", …
## $ campaign_type <ord> Type B, Type B, Type B, Type B, Type B, Type B, …
## $ start_date <date> 2016-12-28, 2016-12-28, 2016-12-28, 2016-12-28,…
## $ end_date <date> 2017-02-19, 2017-02-19, 2017-02-19, 2017-02-19,…
b. Further data preparation for comparative analysis, analyzing customer purchase pattern, and product sales
campaign_df %>%
filter(Campaign == "Not Campaign A") %>%
arrange(desc(Total_Sales)) %>%
head(10) %>%
kbl() %>%
kable_styling()
| basket_id | Campaign | Sales_Volume | Total_Sales |
|---|---|---|---|
| 33015981495 | Not Campaign A | 3 | 36.97 |
| 41479822453 | Not Campaign A | 14 | 34.86 |
| 40300451090 | Not Campaign A | 9 | 33.00 |
| 40853292449 | Not Campaign A | 4 | 31.96 |
| 41338486706 | Not Campaign A | 17 | 29.68 |
| 40800778950 | Not Campaign A | 3 | 27.05 |
| 41109557898 | Not Campaign A | 1 | 24.99 |
| 41259462119 | Not Campaign A | 1 | 24.99 |
| 40955620515 | Not Campaign A | 3 | 23.56 |
| 41479740437 | Not Campaign A | 14 | 22.68 |
campaign_df %>%
filter(Campaign == "Not Campaign A") %>%
summary() %>%
kbl() %>%
kable_styling()
| basket_id | Campaign | Sales_Volume | Total_Sales | |
|---|---|---|---|---|
| Length:155 | Length:155 | Min. : 1.000 | Min. : 0.600 | |
| Class :character | Class :character | 1st Qu.: 1.000 | 1st Qu.: 2.830 | |
| Mode :character | Mode :character | Median : 2.000 | Median : 5.000 | |
| NA | NA | Mean : 2.813 | Mean : 7.176 | |
| NA | NA | 3rd Qu.: 3.000 | 3rd Qu.: 9.040 | |
| NA | NA | Max. :17.000 | Max. :36.970 |
Note
- Total Rows & Columns: 155,4
(respectively)
- Total Sales: Depict the total sales with
corresponding to each basket ID
- Sales Volume: Depict the quantity purchased
corresponding to each basket ID
c. Calculation of Expenditure per Hour:
Expenditure Per Hour (EPH) is a measure coefficient to measure how effective was the sales for a given period of time. The duration can be defined as the time when the promotion campaign was active or the time when there was no promotion active. We have considered EPH on hourly basis to calculate dollar value per hour. EPH is derived by the following formula:
\(\sum (No. of Transactions)/\sum (Total duration)\)
After joining different tables and grouping the aggregation algorithm, we have considered unique households that shopped when Campaign A was active and the same set that shopped where there was no promotion in place. We agreed that EPH would be the best indicator of sales in both the periods (Campaign A active Vs. Not Active) only when we consider unique households that shopped during both the duration.
transactions_eph <- campaign_df_clean %>%
mutate(Campaign = case_when(campaign_type == "Type A" ~ "Campaign A",
TRUE ~ "Not Campaign A"),
Month = month(transaction_timestamp),
Holiday = case_when(Month %in% c(10,11,12,1) ~ "Holiday Season",
TRUE ~ "Not Holiday Season")) %>%
group_by(household_id,Campaign,Holiday,transaction_date) %>%
summarise(Total_Sales = sum(sales_value, na.rm = T)) %>%
group_by(household_id,Campaign,Holiday) %>%
summarise(eph = mean(Total_Sales, na.rm = T)/24) %>%
select(household_id,Campaign,Holiday,eph)
transactions_eph %>%
head(10) %>%
kbl() %>%
kable_styling()
| household_id | Campaign | Holiday | eph |
|---|---|---|---|
| 1 | Campaign A | Holiday Season | 0.9400000 |
| 1 | Campaign A | Not Holiday Season | 0.5216667 |
| 1000 | Campaign A | Not Holiday Season | 0.4616667 |
| 1005 | Campaign A | Not Holiday Season | 0.3818750 |
| 101 | Campaign A | Holiday Season | 0.2725000 |
| 101 | Campaign A | Not Holiday Season | 1.7054167 |
| 1011 | Campaign A | Not Holiday Season | 0.9545833 |
| 1012 | Campaign A | Not Holiday Season | 0.0891667 |
| 1015 | Campaign A | Holiday Season | 0.6512500 |
| 1015 | Campaign A | Not Holiday Season | 0.0854167 |
c. Calculation of Average Basket Value
Average basket Value: Average basket value, also known as units per transaction, refers to the average number of items sold per single transaction. This is calculated by dividing the total value of all transactions by the number of transactions or sales.
\(\sum (Total Value of Transactions)/\sum (NumberofTransactions)\)
d. Campaign Durations
Campaign Durations are listed below:
Campaign A ran for total 196 days. With each duration listed below:
campaign_descriptions %>%
filter(campaign_type == "Type A") %>%
mutate(date_diff = as.numeric(difftime(end_date, start_date, units = "days"))) %>%
kbl() %>%
kable_styling()
| campaign_id | campaign_type | start_date | end_date | date_diff |
|---|---|---|---|---|
| 8 | Type A | 2017-05-08 | 2017-06-25 | 48 |
| 13 | Type A | 2017-08-08 | 2017-09-24 | 47 |
| 18 | Type A | 2017-10-30 | 2017-12-24 | 55 |
| 27 | Type A | 2017-02-08 | 2017-03-26 | 46 |
Campaign A was inactive for all the other days when Campaign A was active.
Campaign A Inactive Duration = 365 - 196 = 169
5. Promotion Analysis
5.1 Analysis
a. Outlier Detection on transactions when Campaign A was Active VS. No Promotion Period
The first plot depicts the total sales distribution by when the Campaign A was running vs. when Campaign A was inactive.
We have also observed some outlier transactions which were exorbitantly large in both the periods. We perceive these transactions which were not driven by any promotion/campaign and could be driven by other factors.
The reader might remember that this dataset in the one we prepared (using wrangling procedures) in our data preparation section.
b. Analyzing total sales in campaign A vs. when campaign A was inactive by holiday season
- Further we have also analyzed the sales during the holiday season, as holiday is one of the biggest drivers of sale. To rule out the hypothesis that higher sales (when campaign A was running) could have been due to the holiday season, we did a comparative analysis of sales in Campaign A during the holiday season and during normal period.
Distribution of total sales by basket id with the median value shown as a white dot. can be seen that the distribution was even throughout the year for campaign A. The distribution when campaign A was not running did not have as strong of a tail.
c. Analyzing sales volume in campaign A vs. when campaign A was inactive by holiday season
Also total sales might not be the correct representation of sales as we are in the promotion period where many products are sold on a discounted price, BOGO or multi-save and conditional promos.
Hence, we did a comparative analysis on Sales Volume (dollar per unit) in Campaign A during the holiday season and during normal period.
Distribution of sales volume by basket id with the median value shown as a white dot. It can be seen that the distribution was even throughout the year for campaign A. The distribution for campaign A was much more spread out than the other campaigns.
e. Analyzing total sales and sales volume when Campaign A was running VS. when Campaign A was inactive
The data was grouped by basket in order to examine the distribution. The data was grouped by campaign in order to examine the difference between campaign A and transactions when campaing A was inactive. Most of the sales volume fall between 0 & 20 with the total sales between $0 & $40.
- The above plot clearly shows that total sales and sales volume was relatively higher while in campaign A as compared to when the promotion was inactive. - The data was grouped by basket ID which is an unique indicator of sales quantity under a transaction
5.2 Measure of Promotion Effectiveness
a. Expenditure by Hour VS. Time of Year
Expenditure Per Hour (EPH) is a measure coefficient to measure how effective was the sales for a given period of time. The duration can be defined as the time when the promotion campaign was active or the time when there was no promotion active. We have considered EPH on hourly basis to calculate dollar value per hour. EPH is derived by the following formula:
\(\sum (No. of Transactions)/\sum (Total duration)\)
The calculation of EPH for the dataset is provided in the previous section - “Data Preparation”
- Generally, a mean EPH of 0.3 and higher for a promotion is
considered to be a “successful promotion”.
- We experienced a mean of 0.53 for Campaign A and 0.13 of mean EPH when campaign A was not running.
## [1] 0.5344225
- Furthermore, we also calculated EPH for transactions in a period where campaign A was not running and we considered both the holiday and the non-holiday season.
- From the above plot it is a clear indication that the major driver of sales were the promotion programs that ran during campaign A. Despite of heavy sales volume experienced in the holiday season, campaign A had a strong mean EPH and higher total sales & sales volume even during the duration when there was no promotion running.
5.2 Demographic Deep-Dive Analysis when Campaign A was active vs. inactive
- Taking a step further, we are analyzing the customer segments that shopped during the campaign period and also when the campaign was inactive.
- We agreed that the customer segmentation would be best observed in both the periods (Campaign A active Vs. Not Active) only when we consider unique households that shopped during both the duration.
- We have considered top 3 income range that had most sales against their household IDs.
The demographics are shown as how the relate to the campaign. The blue lines represent campaign A and the green lines represent sales when campaign was inactive.
5.3 Product Sales Deep-Dive Analysis when Campaign A was active vs. inactive
- Now, once we came to the conclusion that Campaign A was effective in all the sales season, we have further dived a level deeper to analyze the top products that contributed to the sales when Campaign A was active.
- We observed that Milk, Frozen Products and Meat were the top contributors towards sales in both the durations.
Further, to strengthen our assertion about effectiveness of Campaign A on specific products, we have plotted the Average sales volume per basket for Milk, Frozen Products & Meat. These products have been specifically filtered out by COMMON household_ID as they projected highest sales during both the durations.
5.3.1 Average Meat Item Sales Per Basket by Campaign Type & Time of Year
5.3.2 Average Frozen Items Sales Per Basket by Campaign Type & Time of Year
5.3.3 Average Milk Items Sales Per Basket by Campaign Type & Time of Yea
6. Summary & Recommendations
6.1: Interesting Insights:
Consumer spending in Campaign A seems to be highest if consumers are married, have a household size of 2 to 3, earn somewhere between $50-99k, and who purchase heavily in both holiday and non-holiday season.
The important product segments is Meat, Milk & Frozen Items with highest sales. These products have exhibited tremendous sales in both the duration of when the campaign A was active vs. inactive and also during the holiday season vs. non-holiday season.
Mean EPH of 0.3 and higher for a promotion is considered to be a “successful promotion”. We experienced a mean of 0.53 for Campaign A and 0.13 of mean EPH when campaign A was not running.
6.2: Recommendations
Few recommended actions to boost short-term sales during promotion periods are listed below:
1. To maximize customer response:
Collect data from existing members for a certain time period, determine
customer-to-promotion category match for individual customers, and send
single offers accordingly.
2. To predict new customer behavior
Additional data (e.g., occupation, education level, location) might be
requested in the profile section, when a new customer signs up to be a
Regork to test for possible features that might contribute to a
customer-to-promotion match.
3. To maximize success rate of Campaign A:
- Send single offers via social networks and mobile app.
- Offer discount and BOGO promotions with low purchase requirement.